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Professor Efron is to be congratulated for his inno- 
vative and valuable contributions to large-scale mul- 
tiple testing. He has given us a very interesting arti- 
cle with much material for thought and exploration. 
The two-group mixture model (2.1) provides a con- 
venient and effective framework for multiple testing. 
The empirical Bayes approach leads naturally to the 
local false discovery rate (Lfdr) and gives the Lfdr 
a useful Bayesian interpretation. This and other re- 
cent papers of Efron raised several important issues 
in multiple testing such as theoretical null versus 
empirical null and the effects of correlation. Much 
research is needed to better understand these issues. 

Virtually all FDR controlling procedures in the 
literature are based on thresholding the ranked p- 
values. The difference among these methods is in 
the choice of the threshold. In multiple testing, typ- 
ically one first uses a p- value based method such as 
the Benjamini-Hochberg procedure for global FDR 
control and then uses the Lfdr as a measure of signif- 
icance for individual nonnull cases. See, for example, 
Efron (2004, 2005). In what follows I will first dis- 
cuss the drawbacks of using p-value in large-scale 
multiple testing and demonstrate the fundamental 
role played by the Lfdr. I then discuss estimation of 
the null distribution and the proportion of the non- 
nulls. I will end with some comments about dealing 
with the dependency. 

In the discussion I shall use the notation given in 
Table 1 to summarize the outcomes of a multiple 
testing procedure. 
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With the notation given in the table, the false dis- 
covery rate (FDR) is then defined as FDR = 
E(7Vi /i?| J R>0)Pr(i?>0). 

1. THE USE OF p-VALUES: VALIDITY 
VERSUS EFFICIENCY 

In the classical theory of hypothesis testing the 
p-value is a fundamental quantity. For example, the 
decision of a test can be made by comparing the p- 
value with the prespecified significance level a. In 
the more recent large-scale multiple testing litera- 
ture, p-value continues to play a central role. As 
mentioned earlier, nearly all FDR controlling pro- 
cedures separate the nonnull hypotheses from the 
nulls by thresholding the ordered p- values. 

A dual quantity to the false discovery rate is the 
false nondiscovery rate FNR = E(N i/S\S > 0) x 
Pr(S > 0). In a decision-theoretical framework, a 
natural goal in multiple testing is to find, among 
all tests which control the FDR at a given level, the 
one which has the smallest FNR. We shall call an 
FDR procedure valid if it controls the FDR at a 
prespecified level a, and efficient if it has the small- 
est FNR among all FDR procedures at level a. The 
literature on FDR controlling procedures so far has 
focused virtually exclusively on the validity; the ef- 
ficiency issue has been mostly untouched. 

In a recent article, Sun and Cai (2007) considered 
the multiple testing problem from a compound de- 
cision point of view. It is demonstrated that p- value 
is in fact not a fundamental quantity in large-scale 
multiple testing; the local false discovery rate (Lfdr) 
is. Thresholding the ordered p- values does not in 
general lead to efficient multiple testing procedures. 
The reason for the inefficiency of the p- value meth- 
ods can be traced back to Copas (1974) where a 
weighted classification problem was considered. Co- 
pas (1974) showed that if a symmetric classification 
rule for dichotomies is admissible, then it must be 
ordered by the likelihood ratios, which is equivalent 
to being ordered by the Lfdr. Sun and Cai (2007) 
showed that, under mild conditions, the multiple 
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testing problem is in fact equivalent to the weighted 
classification problem. I will discuss below some of 
the findings in Sun and Cai (2007) and draw con- 
nections to the present paper by Professor Efron. 

The local false discovery rate, defined in (2.7), 
was first introduced in Efron et al. (2001) as the 
a posteriori probability of a gene being in the null 
group given the z-score z. The results in Sun and Cai 
(2007) show that the Lfdr is a fundamental quantity 
which can be used directly for optimal FDR control. 
By using the Lfdr directly for testing, the goals of 
global error control and individual case interpreta- 
tion are naturally unified. 

For convenience, in the following we shall work 
with the marginal false discovery rate mFDR = 
E(iVio)/E(i?) and the marginal false nondiscovery 
rate mFNR = E(iV i)/E(S). The mFDR is asymp- 
totically equivalent to the usual FDR under weak 
conditions, mFDR = FDR + 0(m -1 / 2 ), where m is 
the number of hypotheses. See Genovese and Wasser- 
man (2002). 

It is illustrative to first look at an example in 
the so-called oracle setting, where in the two-group 
mixture model (2.6) the proportion po, the density 
/o of the null distribution and the density / of the 
marginal distribution are assumed to be known. In 
this case, both the optimal threshold for the p- values 
and the optimal threshold for the Lfdr values can be 
calculated for any given mFDR level. We shall call 
a testing procedure with the optimal cutoff the ora- 
cle procedure. Suppose the z-values z\,...,z m come 
from a normal mixture distribution with 

(!) f( z ) = Po4>( z ) + Pi<f>( z - A*i) +P2<P{z - M2), 

where po = 0.8, p\ + P2 = 0.2. That is, in the two- 
group model (2.6), the null distribution is iV(0, 1), 
the distribution of the nonnulls is a two-component 
normal mixture, and the total proportion of the non- 
nulls is 0.2. Figure 1 compares the performance of 
the p- value and Lfdr oracle procedures (see Sun and 
Cai, 2007). 

In Figure 1, panel (a) plots the mFNR of the two 
oracle procedures ELS cl function of p\ in (1) where 



the mFDR level is set at 0.10, and the means under 
the alternative are \x\ = —3 and \ii = 3. Panel (b) 
plots the mFNR function of p\ in the same 

setting except that the alternative means are fii = 
—3 and \xi = 6. In panel (c) we choose mFDR= 0.10, 
pi = 0.18, p 2 = 0.02, fa = -3 and plot the mFNR as 
a function of H2- Panel (d) plots the mFNR as a 
function of the mFDR level while holding \i\ — — 3, 
^2 = 1, ^ = 0.02, p 2 = 0.18 fixed. 

It is clear from the plots that the p-value oracle 
procedure is dominated by the Lfdr oracle proce- 
dure. At the same mFDR level, the mFNR of the 
Lfdr oracle procedure is uniformly smaller than the 
mFNR of the p- value oracle procedure. The largest 
difference occurs when \^\ \ < fJ>2 and Pi >P2, where 
the alternative distribution is highly asymmetric 
about the null. When \fj,i \ = \fJ>2\, the mFNR remains 
a constant for the p- value oracle procedure, while 
the mFNR for the Lfdr oracle procedure can be no- 
ticeably smaller when pi and P2 are significantly dif- 
ferent, in which case the nonnull distribution has a 
high degree of asymmetry. The Lfdr oracle proce- 
dure utilizes the distributional information of the 
nonnulls, but the p- value oracle procedure does not. 

The Lfdr oracle procedure ranks the relative im- 
portance of the test statistics according to their like- 
lihood ratios. An interesting consequence of using 
the Lfdr statistic in multiple testing is that an ob- 
servation located farther from the null (i.e., a larger 
absolute z-value or equivalently a smaller p-value) 
may have a lower significance level. It is therefore 
possible that the test accepts a more "extreme" ob- 
servation while rejecting a less extreme observation, 
which implies that the rejection region is asymmet- 
ric. This is not possible for a testing procedure based 
on the individual p- values, whose rejection region is 
always symmetric about the null. This can be seen 
from Figure 2. The left panel compares the mFNR 
of the p- value oracle procedure and Lfdr oracle pro- 
cedure and the right panel compares the rejection 
region in the case of pi = 0.15. In this case the Lfdr 
procedure rejects a z-value of —2 (Lfdr = 0.227, p- 
value = 0.046) but not a z-value of 3 (Lfdr = 0.543, 
p- value = 0.003). More numerical results are given in 
Sun and Cai (2007). The results show that the Lfdr 
oracle procedure dominates the p- value procedure in 
all configurations of the nonnull hypotheses. 

The difference between the two procedures can be 
even more striking when the alternative distribution 
fi is highly concentrated. In this setting, it is pos- 
sible that the extreme p-values near both and 1 
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actually all come from the null distribution instead 
of the nonnull distribution! In such a case, thresh- 
olding the p- values fails completely as a method for 
separating the nonnull hypotheses from the nulls. 
In contrast, the Lfdr can still be effective in distin- 
guishing between the null and nonnull cases. 

In real applications, the proportion po and the 
density of the marginal distribution / are unknown. 
With a large number of observed z-values, both pq 
and / can be estimated well from the data. In this 
regard, the large-scale nature of the problem is a 
blessing. The null distribution is more subtle. If all 
the mathematical assumptions are satisfied, the the- 
oretical null distribution is true and thus can be used 
to compute the Lfdr values. Otherwise, as argued 
convincingly by Efron in Section 5 of the present pa- 
per, the empirical null distribution should be used 
and it can be estimated from the data. Among the 
three quantities, po, fo and /, the marginal den- 
sity / is relatively easier to estimate than po and 
fo . Optimal estimation of these quantities is a chal- 
lenging problem. We shall discuss the estimation is- 
sue in the next section. Let us assume for the mo- 



ment that we already have consistent estimators po , 
fo and /. Such consistent estimators are provided, 
for example, in Jin and Cai (2007). Define the es- 
timated Lfdr by Lfdr(zj) = [pofo(zi)/ f(zi)] A 1. Sun 
and Cai (2007) introduced the following adaptive 
step-up procedure: 

( i « ] 

Let k = max< i : — Lfdr(j) < a > , 
I 1 j=i J 

(2) 

then reject all Hu\, i = 1, . . . , k. 

It was shown that the data-driven procedure (2) con- 
trols the mFDR at level a asymptotically and the 
mFNR level of the adaptive procedure (2) is asymp- 
totically equal to the mFNR level achieved by the 
Lfdr oracle procedure. In this sense, the adaptive 
procedure (2) is asymptotically efficient. Numerical 
studies in Sun and Cai (2007) show that this adap- 
tive procedure outperforms the step-up procedure 
(Benjamini and Hochberg, 1995) and the adaptive 
p-value based procedure (Benjamini and Hochberg, 
2000; Genovese and Wasserman, 2004). The numer- 
ical results are consistent with the theoretical argu- 
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(a). Comparison of oracle rules (b). Rejection regions 
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Fig. 2. Symmetric rejection region versus asymmetric rejection region. In the mixture model (1), fii = —3 and /j,2 =4. Both 
procedures control the mFDR at 0.10. 



ments. These results together show that the Lfdr, 
not the p- value, is a fundamental quantity for large- 
scale multiple testing. 

It is clear that the performance of the adaptive 
testing procedure (2) depends to a certain extent on 
the estimation accuracy of the estimators po, /o and 
/. This leads to the estimation issue, which will be 
discussed next. 

2. ESTIMATING THE NULL DISTRIBUTION 
AND THE PROPORTION OF THE NONNULLS 

As demonstrated convincingly in this and other 
recent papers of Efron, the true null distribution of 
the test statistic can be quite different from the the- 
oretical null and two seemingly close choices of the 
null distribution can lead to substantially different 
testing results. This demonstrates that the problem 
of estimating the null density fo is important to si- 
multaneous multiple testing. In addition to the null 
density fo, the proportion of the nonnulls is another 
important quantity. 

Conventional methods for estimating the null pa- 
rameters are based on either moments or extreme 



observations. In the present paper, two methods, an- 
alytical and geometric, for estimating the null den- 
sity are discussed. In addition, Efron (2004) sug- 
gested an approach which uses the center and half 
width of the central peak of the histogram for esti- 
mating the parameters of the null distribution. These 
methods are convenient to use. However, the prop- 
erties of these estimators are still mostly unknown. 
For example, the analytical method appears to be 
quite sensitive to the choice of the interval [a, b]. It 
is interesting to understand how the choice of [a, b] 
affects the resulting estimator /n, and more impor- 
tantly the outcomes of the subsequent testing pro- 
cedures. 

The three null density estimation methods men- 
tioned above rely heavily on the sparsity assumption 
which means that the proportion of nonnulls is small 
and most of the z-values near zero come from the 
nulls. In the nonsparse case these methods of esti- 
mating the null densities do not perform well and it 
is not hard to show that the estimators are generally 
inconsistent. 
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Jin and Cai (2007) introduced an alternative fre- 
quency domain approach for estimating the null pa- 
rameters by using the empirical characteristic func- 
tion and Fourier analysis. The approach demonstrates 
that the information about the null is well preserved 
in the high-frequency Fourier coefficients, where the 
distortion of the nonnull effects is asymptotically 
negligible. The approach integrates the strength of 
several factors, including sparsity and heteroscedas- 
ticity, and provides good estimates of the null in a 
much broader range of situations than existing ap- 
proaches do. The resulting estimators are shown to 
be uniformly consistent over a wide class of param- 
eters and outperform existing methods in simula- 
tions. The approach of Jin and Cai (2007) also yields 
a uniformly consistent estimator for the proportion 
of nonnull effects. In a two-component normal mix- 
ture setting, Cai, Jin and Low (2007) proposed an 
estimator of the proportion and developed a mini- 
max theory for the estimation problem. 

Much research is still needed in this area. In par- 
ticular, it is of significant interest to understand 
how well the null density can be estimated and how 
the performance of the estimators affects the per- 
formance of the subsequent multiple testing proce- 
dures. 

3. MODELING THE DEPENDENCY 

This paper also raised the important issue of the 
effects of correlation on outcomes of the testing pro- 
cedures. Observations arising from large-scale 
multiple comparison problems are often dependent. 
For example, different genes may cluster into groups 
along biological pathways and exhibit high correla- 
tion in microarray experiments. It is noted in this 
paper that correlation can considerably widen or 
narrow the null distribution of the z- values, and so 
must be accounted for in deciding which hypotheses 
should be reported as nonnull. In fact, the notion 
of null distribution itself becomes unclear in the de- 
pendent case. 

The focus of previous research on the effects of 
correlation has been exclusively on the validity of 
various multiple testing procedures under depen- 
dency. For example, Benjamini and Yekutieli (2001) 
and Wu (2008) showed that the FDR is controlled 
at the nominal level by the step-up procedure (Ben- 
jamini and Hochberg, 1995) and the adaptive p- 
value procedure (Benjamini and Hochberg, 2000; 
Storey, 2002; Genovese and Wasserman, 2004) under 



different dependency assumptions. While the valid- 
ity issue is important, the efficiency issue is arguably 
more important. 

Intuitively it is clear that the dependency struc- 
ture among hypotheses is highly informative in si- 
multaneous inference and can be exploited to con- 
struct more efficient tests. For example, in compara- 
tive microarray experiments, it is found that changes 
in expression for genes can be the consequence of 
regional duplications or deletions, and significant 
genes tend to appear in clusters. Therefore, when 
deciding the significance level of a particular gene, 
the observations from its neighborhood should also 
be taken into account. It is still an open problem how 
to accommodate the correlation for the construction 
of valid and efficient multiple testing procedures. 

4. CONCLUDING REMARKS 

The two-group mixture model and the empirical 
Bayes approach together provide a useful general 
framework for multiple testing. The Lfdr, not the 
p-value, is a fundamental quantity for large-scale 
multiple testing. The problem of estimating the null 
density and the proportion of the nonnulls is impor- 
tant to simultaneous multiple testing. This paper 
raises many important questions and will definitely 
stimulate new research in the future. I thank Pro- 
fessor Efron for his clear and imaginative work. 
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